Data storage is a central issue for researchers and others facing the daunting challenge of interpreting vast genomic and other datasets in the terabyte, petabyte range and beyond. Collaboration and the critical sharing of data such as between research and clinical facilities (both local and remote) requires the handling of a huge influx of datasets and requires strategies for lossless data compression to enable efficient and accurate comparison such as log files or a reference genome, which are federally required for audit and reporting purposes. Data storage and secure transfer of the same has been accomplished by sending physical hard drives to local or distant locations, which is cumbersome, expensive and increasingly unsustainable.
Freely available research-based tools exist as do other applications and publications in fields related to algorithms and methods for selective compression, coding, indexing, annotation, mapping, and alignment of large data files including, but not limited to: repetitive sequence collections, text, binary text images, and databases.
Rapid improvements in high-throughput next generation sequencing technologies yield genomic datasets (both complete genomes and population-type) that are accumulating at an exponential rate. Publicly available genomic datasets are typically stored as flat text files with increasing burdens for digital storage, analysis, and secure transmission. Storing, sharing, analyzing, or downloading large data files such as genetic information or other Big Data remotely is laborious and nearly impossible for many institutions, especially those in parts of the world lacking high-speed internet access. Today, large storage sites spend millions of dollars on storage, and massive data transfer remains an imposing burden on servers and internet networks. There exists an urgent technical need for a solution to enable the storing, sharing, transmission, etc. of large datasets without the loss or deterioration of data. With present technologies, this is at best laborious and, in many cases, nearly impossible for many entities, such as institutions, especially research centers, hospitals and others.